[improvement](cgroup) inactive_file should be treated as available memory to avoid query be cancelled#64347
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
…mory to avoid query be cancelled
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 29737 ms |
TPC-DS: Total hot run time: 169589 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
/review |
There was a problem hiding this comment.
I found two cgroup-v2 accounting issues that should be fixed before merge.
Critical checkpoint conclusions:
- Goal: the PR tries to treat active file cache as reclaimable so cgroup memory checks do not reject allocations too early. The goal is only partially met because the new BE accounting can now warn continuously for valid zero counters and can fail refreshes on non-atomic cgroup snapshots.
- Scope: the code change is small and focused, with matching BE and cloud paths for the main active_file subtraction.
- Concurrency/lifecycle: no new shared-state locking or lifecycle ownership was added. However, the cgroup files are read as separate external snapshots while memory/cache state can change concurrently; that race is not handled safely in BE.
- Config/compatibility/persistence/data writes: no new config items, protocol/storage compatibility changes, transaction changes, or data-write paths are involved.
- Parallel paths: BE and cloud both subtract active_file; the blocking issues are in the BE validation/logging path.
- Tests: no dedicated test was added for active_file subtraction, zero-valued metrics, or the negative adjusted-usage edge. Existing PR checks show Linux BE and cloud tests passing; the macOS BE UT failure is an environment JDK-17 mismatch, not a code failure.
- Observability/performance: the new warning can become high-frequency log spam on normal zero file-cache metrics.
User focus: no additional user-provided review focus was supplied.
| int64_t active_file = metrics_map.contains("active_file") ? metrics_map["active_file"] : 0; | ||
| int64_t slab_reclaimable = | ||
| metrics_map.contains("slab_reclaimable") ? metrics_map["slab_reclaimable"] : 0; | ||
| if (inactive_file < 1 || active_file < 1 || slab_reclaimable < 1) { |
There was a problem hiding this comment.
These counters can legitimately be zero; that is different from the metric being absent. For example, a cgroup with only anonymous memory and no file-cache pages reports inactive_file 0 and active_file 0, but refresh_proc_meminfo() calls this path every memory_maintenance_sleep_time_ms (50 ms by default), so this warning would be emitted continuously. Please check key presence separately from the value, and rate-limit any warning for truly missing metrics.
| // the reason why we subtract inactive_file described here: | ||
|
|
||
| const int64_t reclaimable_usage = inactive_file + active_file + slab_reclaimable; | ||
| if (*value < reclaimable_usage) { |
There was a problem hiding this comment.
memory.current and memory.stat are read in separate syscalls, so this sum is not an atomic snapshot. With concurrent page-cache growth, memory.current can be from before the growth while active_file + inactive_file + slab_reclaimable is from after it; returning CgroupError makes MemInfo::refresh_proc_meminfo() drop the cgroup refresh state and temporarily recompute physical/available memory from host /proc/meminfo instead of the cgroup limit. Please handle the race by clamping/retrying rather than treating it as a fatal invariant.
|
run buildall |
TPC-H: Total hot run time: 28728 ms |
TPC-DS: Total hot run time: 169012 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Automated review summary for 6b32cfe.
I reviewed the full PR diff and the existing inline review threads before proposing any new issue. I did not find an additional non-duplicate blocking issue to raise.
Checkpoint conclusions:
- Goal: the PR changes cgroup v2 adjusted memory usage to subtract active_file in BE and cloud meta-service rate-limit accounting. The current code implements that formula.
- Scope: the change is small and focused on cgroup memory accounting plus explanatory comments.
- Concurrency/lifecycle: the BE path is periodically refreshed by the daemon and reads cgroup files via separate syscalls; the previously raised race around memory.current versus memory.stat is already covered by the existing inline thread, so I did not duplicate it.
- Configuration/compatibility: no new config, protocol, storage, transaction, or persistence compatibility concerns.
- Parallel paths: BE and cloud meta-service both received the active_file formula change.
- Error handling/observability: no ignored Status was introduced in the changed code. Existing comments already cover the cgroup metric/race handling concerns.
- Tests: no direct new deterministic test was added for the active_file accounting or clamp path. Existing TeamCity BE UT and Cloud UT statuses are successful; GitHub macOS BE UT failed before tests due runner JDK 25 instead of required JDK 17, and the TeamCity coverage merge status is failing.
- User focus: .code-review.QXaRV6/review_focus.txt contains no additional focus points.
No new inline comments submitted to avoid duplicating the already-known review threads.
|
/review |
There was a problem hiding this comment.
Automated review summary for 6b32cfe.
I reviewed the full PR diff, the repository code-review checklist, the BE common and Cloud meta-service module guides, and the existing inline review threads before proposing any issue. I did not find an additional non-duplicate blocking issue to raise.
Critical checkpoint conclusions:
- Goal: the PR changes cgroup v2 adjusted memory usage to count
active_fileas reclaimable in both BE memory accounting and Cloud meta-service stress-rate-limit accounting. The implementation applies that formula in both paths. - Tests: I did not see a direct new deterministic test for the
active_fileformula or cgroup-stat edge cases. Current PR status shows compile, BE UT, Cloud UT, FE UT, and regression checks passing; GitHub macOS BE UT is failing before meaningful validation due the runner JDK mismatch, and TeamCity coverage merge is failing. - Scope: the patch is small and focused on cgroup memory accounting plus comments.
- Concurrency/lifecycle: no new shared ownership or locking was added. BE still reads
memory.currentandmemory.statas separate external snapshots; that race and the related clamp/log behavior are already covered by the existing inline review thread, so I did not duplicate it. - Configuration/dynamic config: no new config item was added.
- Compatibility/persistence/transactions/data writes: no FE-BE protocol, storage format, EditLog, transaction, or data-write path is involved.
- Parallel paths: BE cgroup v2 and Cloud meta-service cgroup v2 accounting were both updated; cgroup v1 continues to use RSS-only accounting.
- Special conditions/error handling: no new ignored
Statuswas introduced in the changed code. The already-known metric-presence and non-atomic snapshot concerns are represented by existing inline threads. - Observability/performance: no new hot-path allocation or lock issue found in this small patch; any warning behavior around the cgroup snapshot edge is already part of the existing review context.
- User focus:
.code-review.RZf5IY/review_focus.txtcontains no additional focus points, so there was no extra focused issue to report.
I also ran git diff --check against the provided base/head range; it reported no whitespace errors.
|
skip check_coverage |
|
PR approved by at least one committer and no changes requested. |
…mory to avoid query be cancelled (#64347) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Sometimes we may see this errors in cgroup or k8s environment. Allocator sys memory check failed: Cannot alloc:5343, ...,process memory used 85.41 GB exceed limit 108.00 GB or sys available memory 5.88 GB less than low water mark 6.00 GB. The mem_limit term is false (85.41 < 108). The 5343-byte allocation is rejected only by sys available memory 5.88 GB < low water mark 6.00 GB. 5.88 GiB available implies cgroup_mem_usage of about 114 GiB, roughly 29 GiB above process memory used (85.41 GiB); that gap is unmapped read page cache. The kernel reclaims clean page cache before OOM, so the memory is available, but Doris cannot reclaim it and the rejection repeats on later allocations. (low water mark 6.00 GB is the default: min(120 - 108, 120 * 5%) = 6.) Before this PR, cgroup_mem_usage = memory.current - inactive_file - slab_reclaimable. So some active files page cache is not treated as recycleable memory. So cgroup_mem_usage is a bit larger than RSS. ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
…mory to avoid query be cancelled (apache#64347) Issue Number: close #xxx Related PR: #xxx Problem Summary: Sometimes we may see this errors in cgroup or k8s environment. Allocator sys memory check failed: Cannot alloc:5343, ...,process memory used 85.41 GB exceed limit 108.00 GB or sys available memory 5.88 GB less than low water mark 6.00 GB. The mem_limit term is false (85.41 < 108). The 5343-byte allocation is rejected only by sys available memory 5.88 GB < low water mark 6.00 GB. 5.88 GiB available implies cgroup_mem_usage of about 114 GiB, roughly 29 GiB above process memory used (85.41 GiB); that gap is unmapped read page cache. The kernel reclaims clean page cache before OOM, so the memory is available, but Doris cannot reclaim it and the rejection repeats on later allocations. (low water mark 6.00 GB is the default: min(120 - 108, 120 * 5%) = 6.) Before this PR, cgroup_mem_usage = memory.current - inactive_file - slab_reclaimable. So some active files page cache is not treated as recycleable memory. So cgroup_mem_usage is a bit larger than RSS. None - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Sometimes we may see this errors in cgroup or k8s environment. Allocator sys memory check failed: Cannot alloc:5343, ...,process memory used 85.41 GB exceed limit 108.00 GB or sys available memory 5.88 GB less than low water mark 6.00 GB.
The mem_limit term is false (85.41 < 108). The 5343-byte allocation is rejected only by sys available memory 5.88 GB < low water mark 6.00 GB. 5.88 GiB available implies cgroup_mem_usage of about 114 GiB, roughly 29 GiB above process memory used (85.41 GiB); that gap is unmapped read page cache. The kernel reclaims clean page cache before OOM, so the memory is available, but Doris cannot reclaim it and the rejection repeats on later allocations. (low water mark 6.00 GB is the default: min(120 - 108, 120 * 5%) = 6.)
Before this PR, cgroup_mem_usage = memory.current - inactive_file - slab_reclaimable. So some active files page cache is not treated as recycleable memory. So cgroup_mem_usage is a bit larger than RSS.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)